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Abstract: The control and risk assessment in complex information systems require to take 
into account extremes arising from nodes with large node degrees. Various sampling techniques 
like a Page Rank random walk, a Metropolis-Hastings Markov chain and others serve to collect 
information about the nodes. The paper contributes to the comparison of sampling techniques in 
complex networks by means of the first hitting time, that is the minimal time required to reach a 
large node. Both the mean and the distribution of the first hitting time is shown to be determined 
by the so called extremal index. The latter indicates a dependence measure of extremes and also 
reflects the cluster structure of the network. The clustering is caused by dependence between 
nodes and heavy-tailed distributions of their degrees. Based on extreme value theory we estimate 
the mean and the distribution of the first hitting time and the distribution of node degrees by 
real data from social networks. We demonstrate the heaviness of the tails of these data using 
appropriate tools. The same methodology can be applied to other complex networks like peer- 


to-peer telecommunication systems. 
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1. INTRODUCTION 

Modern complex networks like online social (OSN), peer- 
to-peer (P2P) and content-centric networks and the world 
wide web (WWW) are in general nonlinear information 
systems. The control and risk assessment in complex in¬ 
formation systems require to take into account extremes 
arising from nodes with large node degrees. The giant 
extremal nodes impact on the work and the development 
of the whole system more than small nodes. The inves¬ 
tigation of all nodes is costly since the networks are very 
large. Thus, sampling techniques via crawling are proposed 
as tools to collect node samples. Uniform and random 
walk sampling, PageRank (Avrachenkov et al. [2010]), 
non-backtracking random walk with re-weighting (NBRW) 
(Lee et al. [2012]), the random walk Metropolis and 
Metropolis-Hastings algorithms (Metropolis et al. [1953], 
Hastings [1970]) give examples of possible approaches. 
The giant nodes surrounded by smaller nodes build clus¬ 
ters of connected nodes. Within the clusters, node degrees 
may exceed sufficiently large thresholds {u n }. By the ex¬ 
treme value theory the node degrees exceeding u n such 
that u n —» oc as sample size n —» oo build a compound 
Poisson process, Beirlant et al. [2004], Leadbetter [1983]. 
Roughly speaking, tops of the clusters become independent 
and determine independent clusters of the network. It is 
visible for a Twitter network taken as an example in Fig. 
1. With this respect, it is important to evaluate the first 
hitting time, that is the minimal time required to reach a 
large node. This allows us to disseminate information and 
advertisement more effectively and to upload it directly to 
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analysis, first hitting time, heavy-tailed 


top-nodes of such clusters. 

The extremal index is a key characteristic of cluster 



Fig. 1. Clusters of nodes on a Twitter map Leaflet [cited 
December 2014]. 

extremes. Its reciprocal approximates the mean cluster 
size, i.e. the number of exceedances of the threshold per 
cluster, Leadbetter [1983]. It determines the first hitting 
time and its distribution and mean, Roberts et al. [2006], 
Markovich [2015]. The extremal index allows us to repre¬ 
sent the distribution of the maximal node degree (3) and 
its quantiles. Thus, the estimation of the extremal index 
is one of the subjects of the paper. 

Another problem that is related to the extremal index 
estimation is given by the detection of the heaviness of 
tails of the node degree distribution. The presence of heavy 
tails may dramatically impact on the first hitting time of 
the sampling technique and its effectiveness. The power 
law distribution which has asymptotically a Pareto tail 







is widely applied to model the node degree distribution, 
Litvak et al. [2007], Newman [2006]. However, such models 
may fit the distributions unsatisfactory and do not satisfy 
nonparametric tests, Litvak et al. [2007]. the reason is that 
distributions may include mixtures of heavy- and light- 
tailed distributions. This applies also to the popularity of 
a content transmitted through content-centric networks, 
Imbrenda et al. [September 24-26, 2014]. 

We focus here on the Metropolis algorithm which con¬ 
structs nonlinear time-reversible Markov chains with a 
given, desired stationary distribution i r(x), Andrieu et al. 
[2011]. For a given graph of the network there are poten¬ 
tially many irreducible 1 Markov chains (or random walks) 
preserving the same stationary distribution 7r. To select 
the best one, criterion like the mixing time (Lee et al. 
[2012]) and related to it the second largest eigenvalue of 
the associated random walk transition probability matrix, 
Avrachenkov et al. [2010] as well as the convergence rate 
of a Markov chain (Mengersen and Tweedie [1996]) are 
usually applied. 

On this respect, heavy-tailed distributions ir(x) of a 
Metropolis random walk generate specific problems. The 
convergence rate of such Markov chain is not a geometric 
but a polynomial one, Roberts and Smith [1994]. This 
leads to an infinitely long first hitting time to reach a node 
with a large degree. The latter time is finite in case of a 
light-tailed i t(x). 

The objectives of the paper are the following. Summarizing 
theoretical achievements regarding extremes of stochas¬ 
tic sequences, we apply them to real data sets of social 
networks. We make conclusions regarding the possible 
effectiveness of Markov chains generated by the Metropolis 
algorithm with a heavy-tailed stationary distribution i t(x). 
To this end (1) we evaluate the node degree distribution 
by real data and test it regarding the presence of heavy 
tails by appropriate tools, (2) we investigate extreme value 
statistics such as the extremal index and the first hitting 
time, namely, its mean and distribution by real data of 
social networks. 

The paper is organized as follows. In Section 2.1 definitions 
and related theoretical results concerning the extremal 
index and the first hitting time are given. The power law 
model is given in Section 2.2. The random walk Metropolis 
algorithm is described in Section 2.3. Applications to real 
data sets of two social networks and the estimation of 
the extremal index and the first hitting time are given 
in Section 3. Conclusions are stated in Section 4. 

2. DEFINITIONS AND RELATED WORK 

2.1 The extremal index and the first hitting time 

Let {X n } n >i be a stationary sequence with marginal 
distribution function F{x) and M n = max{Xi,..., X n }. 
One may involve that {X n } is a sequence of node degrees. 

Definition 1. Leadbetter [1983] The stationary sequence 
{X n } n >i is said to have the extremal index 0 E [0,1] if 
for each constant 0 < r < oo there is a sequence of real 
numbers (thresholds) u n = u n (r) such that 

lim n( 1 — F(u n )) = r and (1) 

n—»• oo 

1 This means that every node in the network is reachable in a finite 
time with a positive probability. 


lim P{M n < u n } = e t6 (2) 

n—^oo 

hold. 

0 is a dependence measure of extremes in the following 
sense. For a sufficiently large sample size n and the 
threshold sequence {u n } 

P{M n < u n } = F ne (u n ) + o(l) (3) 

holds. If {X n } are independent random variables then 
0 = 1 as far as 0 « 0 corresponds to a strong dependence. 
If 7 t(x) behaves as a power function, i.e. if 

7 r(x) ~ cx~ a (4) 

for some c > 0 and a > 0, the Metropolis algorithm 
gives an example of the pathological case 9 = 0 of total 
dependence, Roberts et al. [2006]. 

The extremal index 0 of {X n } indicates the first hitting 
time T n to exceed level u n , Roberts et al. [2006]. 
Definition 2. The first hitting time T* = T*(u n ) of the 
threshold u n is determined by the following expression 
P{T*(u n ) = j + 1} = P{Mj < u n ,Xj +1 > 7i n }, 

j = 0,1, 2,..., Mo = — oo, Markovich [2015]. 

Since u n is selected according to (1) it follows that 
P{X n > u n } is asymptotically equivalent to 1/n. Notice, 
that it holds 

P {M k < u n } = P{T* > k}. 

Hence, we get 2 

P{T*/n > k/n} ~ e ~ ekr{Xn>Un} ~ e~ ek/n 

and it follows 

lim P(T*/n >x) = e~ ex 

n—»• oo 

for positive x. It follows 

lim E(T*/n) = 1/9. (5) 

E(T*/n) determines the mean first hitting time to find a 
node with the degree exceeding a sufficiently large level. 
(5) implies, that the smaller 0, the longer it takes to reach 
a node with a large degree. 

The latter result is specified in Markovich [2015]. More 
exactly, it holds 

lim p n ET*(x Pn ) = l/e 3 , (6) 

n—t oo 

where the quantile x Pn of the level 1 — p n of the underlying 
sequence 3 {X n } is taken as the threshold u n . Since p n ~ 
r/n according to (1) the result (6) does not contradict (5). 
For example, if 0 = 1/2 then it will take eight times longer 
to find the node degree Di = u n than to arrive at extreme 
levels by an independent sequence. 

The normalized distribution of T*(x Pn ) is derived to be 
geometric with the probability equal to p n 0 , i.e. 

P{T*(x Pn ) = j} ~ ^(1 - PnOy- 1 (7) 

holds under a specific mixing condition, Markovich [2015]. 

2 The / ~ / means asymptotically equal to or f(x) ~ g(x) 
f(x)/g(x) —» 1 as x —> a, x E M. 

3 This implies that P{Xi > x Pn } = p n holds. 




2.2 Power law 


We consider node degrees of real social networks corre¬ 
sponding to indirected graphs. In this case, in- and out- 
degrees coincide. The node degree distributions are be¬ 
lieved to follows power laws (4) and for Web a = 1.1 
for in-degree and PageRank, and a « 2 for out-degree, 
Volkovich et al. [2009], Volkovich and Litvak [2010]. 
Despite the node degree is a discrete random variable, it 
is a common approach to consider a simpler continuous 
Pareto analogue of its distribution, Clauset et al. [2009]. 
The latter belongs to the class of heavy-tailed regularly 
varying distributions with the tail function 

P{X > x} = £(x)x~ a , x > 0, (8) 

where a denotes the tail index responsible for the heaviness 
of the tail, and £(x) is a slowly varying function, that is, 
for x > 0, £(tx)/£(t) -G 1 as t —>> oo. In practice, the latter 
model may fit the tail of the distribution but unlikely the 
body of the distribution. Usually, we do not know £(x) of 
the appropriate model. It could be any positive constant 
or logarithm. The model (8) is sensitive to the estimation 
of the tail index <a, Volkovich et al. [2007]. 

2.3 Metropolis algorithm 

The Metropolis algorithm generates the following Markov 
chain with stationary density ir(x) 

Xi+i = Xi + Zi+il{Ui+i > a(Xi, Xi + Zi+i)} 
for integer i > 0, where 

( \ = f min{7r(j/)/7r(a;), 1}, if tt(x) > 0, 
v ’ y) { 1, if ir (x) = 0 

is the acceptance probability to move from Xi to Xi + 
Zi+i and Xi +1 = Xi otherwise, Roberts et al. [2006]. 
The Zi has an arbitrary proposal density q(x) and Ui is 
a uniformly distributed random variable on (0,1). Here 
{Zi} and {Ui} are independent sequences of independent, 
identically distributed random variables, independent of 
Xq. The starting point Xo can be selected arbitrary. The 
Metropolis algorithm is a special case of the Metropolis- 
Hastings algorithm with a symmetric proposal density. 
Samplings may be compared by rates of convergence to 
7 t(x) of the transition probabilities of a corresponding 
Markov chain 

P n (x,A) = P{X n G A\Xq = x}, n G Z+, 

to fall after n steps to a Borel set A. 

Definition 3. The Markov chain is called geometrically 
ergodic if there exists p > 1 such that 

lim p n ||P{x, •} — 7t(x )(I = 0, 

n—»• oo 

where \\p\\ = supy.|^| <1 |/i(/)| is a total variation norm for 
a signed measure p. 

Definition 4- A Markov chain has polynomial convergence 
rate v if 

v = sup{J : lim n 5 \\ P{x, •} — 7r(x)|| = 0}. 

n—7>oo 

It is remarkable that the Metropolis Markov chain may 
have a geometric rate if i t(x) is a light-tailed distribution, 


Mengersen and Tweedie [1996] and, a polynomial rate if 
7 t(x) is heavy-tailed, Jarner and Roberts [2007]. 

The convergence rate strongly depends on q(x) that can 
be selected. It is derived for power law that 

7r(x) = £(x)/x 1+r , r > 0, (9) 

the polynomial convergence rate of the Metropolis random 
walk may be faster (v = r/rj) if 

q(x) = £ q (\x\)/\x\ 1+v , 0 < 77 < 2, (10) 

holds and slower (v = r/2) if q(x) has a finite variance 
and 0 < lim inf^oo £ q (x) < limsup^^ £ q (x) < oo 
holds, Jarner and Roberts [2007]. Here, £(x) and £ q (x) are 
normalized slowly varying functions. 4 Positive constants, 
functions logx or (pi/p 2 (x)) s , where p\(x) and P 2 (x) are 
polynomials of the same order, and s is a real number pro¬ 
vide examples of £(x). Functions (pi/p 2 (x)) s and positive 
constants give examples of £ q (x), but not logx. 

For a Metropolis algorithm there is a simple relation be¬ 
tween the geometric ergodicity and the extremal index #, 
Roberts et al. [2006]. Namely, one should check the value 
of r] in the limit 

lim (log7r(x)) / = —rj 

x^-oo 

proposed in Mengersen and Tweedie [1996]. The Metropo¬ 
lis algorithm is geometrically ergodic if p > 0. If p = 0 
then 0 = 0. It was derived that if i t(x) is a power law then 
the geometric ergodicity fails and 6 = 0 holds. Hence, the 
mean first hitting time of the Metropolis random walk is 
infinite in this case. 

Selecting the proposal distribution q(x) of a Metropolis 
algorithm, the tail index a = 1 + p is sufficient to find the 
appropriate polynomial convergence rate of a Metropolis 
Markov chain to its stationary distribution ir(x). 

3. MODELING OF REAL DATA 

The Enron email and DBLP networks taken from Leskovec 
and Krevl [2014] are investigated. We aim first a checking 
the power law model (8) for these date sets. 

Let Xi, ..., X n be a random sequence of underlying node 
degrees with the distribution function F(x) and the den¬ 
sity /(x), and ..., Aq n ) be the corresponding order 

statistics. 

3.1 Heavy-tail detection 

It follows from the previous section, that it is important 
to detect the heaviness of the distribution tail and also to 
estimate the tail index which shows how heavy is the tail. 
To this end, we evaluate the mean excess function. It is 
determined by 

e(u)mE(X -u\X >u), 

and 

n n 

e n {u) = ~ «)1 {Xi > u}/J2 HXi > u} 

z—1 i=l 

is the sample mean excess function over the threshold u. 
Here, 1{A} denotes the indicator function of the event A. 

4 This implies that for any 8 > 0 there exists K > 0 such that for 
y > x > K i(y)/i(x) < ( y/x ) 5 . 





For heavy-tailed distributions e(u) tends to infinity. Par¬ 
ticularly, in case of the Pareto distribution it increases 
linearly. For light-tailed distributions e(u) tends to zero 
and it is constant for exponential distribution, Embrechts 
et al. [1997], Markovich [2007]. 

In Fig. 2 one can see plots of the mean excess function 




Threshold, u 


Fig. 2. Mean excess function of the Enron email data (top) 
and the DBPL data (bottom). 

{(u,e(u)) : X(i) < u < X( n )}, both for Enron and DBPL 
data sets. We may conclude that both data sets are heavy¬ 
tailed. Due to a linearity of the Enron-plot one may suggest 
that the Pareto model can be appropriate for the Enron 
email data. For DBPL data one cannot expect stationarity 
since the linear curve is changed by a nearly constant 
line. Hence, we may think that the distribution contains 
a mixture of an exponential and Pareto distributions. For 
large u the plots look misleading due to rare observations 
exceeding such high thresholds. 

The power law density may be determined by 

fpl(x) = (X (n _ fe) ) a Q:a; _ " _1 , (11) 

where f£° f p i(x)dx = 1 holds. 

The reciprocal of the tail index 7 = 1/a may be estimated 
by Hill’s estimator (Hill [1975]) 

k 

Tr H (n,k) = l/fc^ 1 °gX( n _ i+ i ) — log X( n _ k ) (12) 

i= 1 

and by the Ratio estimator. The latter is a generalization 
of the Hill’s one in a sense that instead of X ( n _ k ) in 
(12) an arbitrary threshold x n > 0 is used, Markovich 
[2007]. Both estimators may be applied to dependent data 
such as Markov chains, Novak [2002]. Consistency of Hill’s 
estimator has been derived in Resnick and Starica [1998] 
for the m-dependent heavy-tailed stationary sequences. 
We apply also the Moment estimator which is a function 
of the Hill’s estimator 

7 n,k = l H i n i k ) + 1 - 0.5 (1 - (7 H (n, k)) 2 /s n ,k )) 1 , 


where S n>k = (1/fc) Ei=i (logX (n _ i+1) - logX (n _ fe) ) 2 . A 
survey of other estimators can be found in Markovich 
[2007] among others. 

The number of the largest order statistics k used in all 
estimators is estimated by a double bootstrap method, 
Danielsson et al. [2001]. In Table 1 estimated values of 
the tail index for the Enron email and the DBLP data are 
shown. The number of bootstrap re-samples B = 500 is 
used. 

It is shown that all estimates of a are slightly larger 

Table 1. Estimation of the tail index a by 
real data sets averaged over 500 bootstrap re¬ 
samples. 


Data 

Sample 

size 

k 

Hill 

Ratio 

Moment 

Enron 

DBPL 

36692 

425957 

1659 

2589 

1.337 

1.028 

1.2182 

1.277 

1.023 

1.657 


than 1 but smaller than 2. This implies the infinite 
variance of the node degree distribution according to 
properties of regularly varying distributions (Breiman’s 
theorem), Embrechts et al. [1997], Markovich [2007]. The 
fact that not all moments are finite confirms the heavy¬ 
tailed distributions of both underlying data sets. Since the 
tail index is larger for the Enron data, it follows from (8) 
that its distribution has lighter tail than the DBPL data. 




Fig. 3. The histogram and the power law density (11) for 
the Enron email data with tail index a = 1.337 (top) 
and for the DBPL data with a = 1.277 (bottom). 

The comparison of the histogram and the power law den¬ 
sity (11) for both data sets is shown in Fig. 3. One may 
conclude that the power law fits the Enron email data but 
not the DBPL data. This is in agreement with the mean 
excess function. 

One cannot use the goodness-of-fit tests like Kolmogorov- 
Smirnov or von Mises-Smirnov tests to check the hypoth¬ 
esis regarding the power law. The reason is that the latter 























tests require samples of independent random variables. 
However, due to links between nodes, the node degrees 
are dependent. 

3.2 Extremal index estimation 


To estimate the extremal index 0, we use the intervals 
estimator 


0 n {u) = | 
where 


min(l, (/u)), if max{7} : 1 <i < N — 1} < 2, 
min(l, §n(u)), if max{T^ : 1 < i < N — 1} > 2, 


oh( u ) = 


and 


0 2 n (u) = 


2(EiLT 1 rQ 2 

( N - 1) Eili 1 T i ’ 


2(E^ 1 m-i)) 2 


proposed by Ferro and Segers [2003]. Here, 

n 

n = J2 1 (Xi > «) 




is a number of exceedances of u at time epochs 1 < Si < 
• • • < Sn < n and the interexceedance times are given by 
Ti = Si+i — Si. The intervals estimator does not require 
the selection of any parameter apart of u and demonstrates 
a good accuracy. In contrast, well-known nonparametric 
blocks and runs estimators require the size of blocks as an 
additional parameter to u, Beirlant et al. [2004]. 

In Fig. 4 the intervals estimates are shown for both data 
sets. The appropriate values of 0 are selected correspond¬ 
ing to a stability interval of the curve (u, 6 n (u)). Since 1/6 
approximates the mean cluster size of exceedances of the 
thresholds, we may conclude that the Enron email data 
contains smaller clusters with 4 — 5 nodes on average and 
the DBPL data 5 — 6 nodes on average. 

3.3 First-hitting-time estimation 

Using the obtained estimates 0 G {0.22, 0.15}, we evaluate 
the mean of the first hitting time and its distribution by 
means of (6) and (7), respectively. In Fig. 5 and 6 we use 
the following approximations 

E(T*(x Pn )) « l/(p n e 3 ) 

and 

0 2 P{T*{x Pn ) = j} * p n 0( 1 - p n oy-\ (13) 




Fig. 4. Intervals estimate of 0 for the Enron email data 
(top) and the DBPL data (bottom): 0 = 0.22 and 6 = 
0.15 are selected, respectively, as values corresponding 
to stability intervals of the curves. 



Fig. 5. The mean first hitting time of the Enron email 
and the DBPL data against the quantile level 1 — p: 
larger 1 — p correspond to larger quantiles x p taken as 
thresholds and longer first hitting times. 

3.4 Polynomial convergence rate of the Metropolis random 
walk 


respectively, that are valid for sufficiently large sample size 
n. 

The 95% quantiles x p (p = 0.05) of the node degrees were 
used as thresholds. 

Thus, we may conclude that for the DBPL data the mean 
time required to reach a node with degree larger than 
u = x p is longer than for the Enron email data. This 
reflects on the distributions, too. The distribution of the 
first hitting time of the DBPL data has heavier tail than 
the one of the Enron data. 


Using (9) and (10) as well as the tail index estimates, one 
can easily calculate the polynomial convergence rate of the 
Metropolis random walk. For example, for a = 1.337 w.r.t. 
the Enron data we get from (9) that r = 0.337. To get 
the largest polynomial rate v we select p in (10) as small 
as possible within the interval (0,2). For p = 0.01 we get 
v = 33.7. Similarly, for a = 1.028 w.r.t. the DBPL data we 
get r = 0.028 and for the same p we obtain v = 2.8. This 
implies, that for the DBPL data whose tail distribution 
is heavier than the tail of the Enron, we get the slower 
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Fig. 6. The geometric model (13) of the normalized distri¬ 
bution 6 2 P{T*(x p ) = j} of the first hitting time of 
the Enron email and the DBPL data corresponding 
to 95% quantiles as thresholds. 

polynomial convergence rate of the Metropolis random 
walk. Hence, the sampling by means of the Metropolis 
random walk could be more effective for the Enron email 
data rather than for the DBPL data. 

4. CONCLUSIONS 

We propose to evaluate the optimality of samplings in com¬ 
plex networks using new measures such as the extremal 
index, the distribution of the first hitting time and its 
mean. Considering real data of social networks we conclude 
that a heavier tail of the node degree distribution leads to 
(1) larger node clusters around the giant nodes, (2) slower 
convergence rate of the Metropolis random walk, and (3) 
a longer first hitting time to reach a large node. 
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